ZooScore dataset compiles ZooScores determined for a variety of pathogens and parasites collected from the Global Mammal Parasite Database (GMPD). The image below shows the decision tree that a ZooScore is calculated with, ranging from a score of -1 representing a pathogen not found in humans to a score of 3 representing a pathogen capable of human to human transmission (e.g., SARS-CoV-2).
The first step I took was to thoroughly understand the dataset by creating various visuals. These included simple bar graphs to display counts, box plots to visualize distributions, and summary statistics to gain insights into the dataset’s overall characteristics.
visdat::vis_dat(ZOO)+coord_flip()+scale_fill_viridis_d()+
theme(axis.text.x = element_text(angle = 0, hjust = 1))
There are 28 columns and 2008 rows. Each column represents a variable
related to the parasite and its zooscore calculated by investigators.
Each row represents each parasite. Since the variable
parasite_corrected_name plays a role of index, the total
number of rows and the unique number of
parasite_corrected_name should be matched. To verify this,
I displayed how many distinct values of
parasite_corrected_name exist.
Some variables have too many missing values. In particular,
insect, commensal, xc_notes,
pgf_zoo_score, pgf_c_score,
pgf_notes, notes, print_ref,
xc_citation, pgf_citation,
pgf_more_citations, nematode.
As I delved deeper into the data, I recognized the need to enhance its context. To achieve this, I merged ZooScore dataset with several related sources. For instance, I connected pathogens, species, and diseases using the Gideon Pathogens-Species-Disease dataset. Additionally, the Gideon Disease Traits dataset provided valuable insights. To better understand the animal groups, I utilized the Mammal Taxonomy Dictionary dataset. To visualize geographical distribution, I incorporated the Natural Earth dataset.
xc_c_score & xc_zoo_score
The xc_c_score represents the cross-checked confidence score after review by multiple individuals.The score represents the confidence level in the ZooScore, with 1 indicating high confidence and 3 indicating low/no confidence.The values in xc_c_score appear to be more complete compared to confidence_score, as there are less missing (NA) values. All data points are within the expected range.
ZOO%>%
mutate(xc_zoo_score= xc_zoo_score)%>%group_by(xc_zoo_score, xc_c_score)%>%
summarise(n_row = length(unique(na.omit(parasite_corrected_name))))%>%
ggplot(aes(x =xc_zoo_score,
y = xc_c_score))+
geom_tile(aes(fill=n_row), color = "black",
size = 0.6) +
geom_label(mapping = aes(label = n_row,
color = n_row > median(n_row)),
size = 2.5)+
scale_color_manual(guide = 'none', values = c('TRUE' = '#D31245',
"FALSE" = '#091F40'))+
scale_fill_continuous()+
theme_bw()+
scale_x_continuous(breaks = -2:3,
expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0),
breaks = 1:3) +
labs(x = "ZooScore",
y = "Confidence Score",
fill = "Num") +
theme(aspect.ratio = 0.3) +
theme(axis.text.x = element_text(angle = 0, hjust=1))+
guides(fill = guide_colorbar(ticks = T,
ticks.colour = "black",
ticks.linewidth = 1,
frame.colour = "black",
frame.linewidth = 1,
barwidth = 1,
barheight = 7))
data.frame(table(gsub("^\\s+|\\s+$", "", unlist(strsplit(GID$ParasiteGMPD, ","))))) %>%
left_join(selected, by = c("Var1" = "parasite_corrected_name")) %>%
group_by(Var1) %>%
summarise(species_richness = Freq,
xc_zoo_score = mean(xc_zoo_score)) %>%
filter(!is.na(xc_zoo_score)) %>%
ggplot() +
geom_jitter(aes(x = xc_zoo_score,
y = species_richness,
color = as.factor(xc_zoo_score)),
width = 0.3,
show.legend = F) +
geom_boxplot(aes(group = xc_zoo_score,
y = species_richness,
x = xc_zoo_score,
color = as.factor(xc_zoo_score)),
alpha = 0.3,
outlier.alpha = 0,
show.legend = F,
width = 0.4) +
labs(x = "ZooScore",
y = "Species richness") +
scale_color_brewer(palette = "Dark2") +
scale_x_continuous(breaks = 1:3,
expand = c(0.1, 0.1)) +
scale_y_continuous(labels = scales::comma) +
theme(panel.background = element_rect(fill = "white",
color = "black"),
panel.grid.major = element_line(color = "grey80"),
aspect.ratio = 0.8)
library(patchwork)
google+wofs
p_all<- GID %>%
left_join(MDD[, c("species", "order")], by = "species") %>%
left_join(ZOO, by = c("ParasiteGMPD" = "parasite_corrected_name")) %>%
group_by( order, xc_zoo_score)%>%
summarise(species_richness = length(unique(species)),
xc_zoo_score = mean(xc_zoo_score), .groups="drop") %>%
filter(!is.na(xc_zoo_score),
!is.na(order))%>%
mutate(order = tools::toTitleCase(tolower(order))) %>%
ggplot(aes(x =xc_zoo_score,
y = order))+
geom_tile(aes(fill=species_richness), color = "black",
size = 0.6) +
geom_label(mapping = aes(label = species_richness,
color = species_richness > mean(species_richness)),size = 3)+
scale_color_manual(guide = 'none', values = c('TRUE' = '#D31245',
"FALSE" = '#091F40'))+
scale_fill_continuous()+
theme_bw()+
scale_x_continuous(breaks = -2:3,
expand = c(0, 0)) +
# scale_y_continuous(expand = c(0, 0),
# breaks = 1:3) +
labs(x = "ZooScore",
y = "Order",
fill = "species_richness") +
# theme(aspect.ratio = 0.5) +
theme(axis.text.x = element_text(angle = 0, hjust=1))+
guides(fill = guide_colorbar(ticks = T,
ticks.colour = "black",
ticks.linewidth = 1,
frame.colour = "black",
frame.linewidth = 1,
barwidth = 1,
barheight = 7))
p_all
Through the process of exploration and contextualization, I pinpointed specific areas of interest. I was particularly drawn to pathogens with higher zooscores, as they indicated potential significance. Moreover, I identified popular animal groups such as Rodents, Carnivora, and Artiodactyla, which boast a substantial number of species. These areas became the foundation for my subsequent analyses
#TOP 3 Orders
p_rodent + theme(axis.title.x = element_blank()) + p_carnivora + theme(axis.title.x = element_text(face = "bold"), axis.title.y = element_blank()) + p_artiodactyla + theme(axis.title = element_blank())
To narrow down my focus, I decided to work with a subset of the data. I directed my attention towards specific pathogens that aligned with my areas of interest. These pathogens were Toxoplasma gondii, Borrelia burgdorferi, and Hymenolepis diminuta. By honing in on these pathogens, I could dive deeper into their associated attributes.
Finally, I delved into exploring the patterns and trends within the chosen pathogens. I investigated the relationships between these pathogens and some interesting factors, such as animal groups and countries. This analysis allowed me to uncover insights that could potentially inform further research and decision-making.
A Sankey plot shows the intricate relationships between pathogens with Zooscore = 3, their respective groups, and associated diseases.
library(ggsankey)
# Chart 1
SZ3L%>%ggplot(aes(x = x
, next_x = next_x
, node = node
, next_node = next_node
, fill = factor(node)
, label = node)
)+
geom_sankey(flow.alpha = 0.8
, node.color = "black"
,show.legend = TRUE)+
geom_sankey_label(size = 3, color = "black", fill= "white", hjust = 0)+
theme_bw()+
theme(legend.position = "none")+
theme(axis.title = element_blank(), axis.text.y = element_blank(),
axis.text.x = element_blank(), axis.ticks = element_blank(),
panel.grid = element_blank())+ scale_fill_viridis_d(option = "inferno")+
labs(title = "Animal groups-Selected Pathogens-Countries")+ labs(fill = 'Nodes')
map1<-CNTY %>% #uses rnaturalearth data to grab an sf object of all countries
left_join(VALS1, by = c("name" = "Var1")) %>% #join the GID table data
mapview(zcol = "Freq") #mapview colored by frequency
mapviewOptions("basemaps.color.shuffle" = FALSE)
map1
VALS2 <- data.frame(table(SBOR$country))
map2<-CNTY %>% #uses rnaturalearth data to grab an sf object of all countries
left_join(VALS2, by = c("name" = "Var1")) %>% #join the GID table data
mapview(zcol = "Freq") #mapview colored by frequency
mapviewOptions("basemaps.color.shuffle" = FALSE)
map2
VALS2 <- data.frame(table(SBOR$country))
map2<-CNTY %>% #uses rnaturalearth data to grab an sf object of all countries
left_join(VALS2, by = c("name" = "Var1")) %>% #join the GID table data
mapview(zcol = "Freq") #mapview colored by frequency
mapviewOptions("basemaps.color.shuffle" = FALSE)
map2
VALS0 <- data.frame(table(GID$country))
map3<-CNTY%>% #uses rnaturalearth data to grab an sf object of all countries
left_join(VALS0, by = c("name" = "Var1")) %>% #join the GID table data
mapview(zcol = "Freq") #mapview colored by frequency
map3